Week 3: Data Visualization

{ggplot2}

Author

Affiliation

Eunji Kong
(adapted from Dr. Joe Nese’s lecture)

University of Oregon
Fall 2025

#install.packages("tidyverse")
#install.packages("palmerpenguins")
#install.packages("patchwork")
#install.packages("ggridges")
#install.packages("gghighlight")
#install.packages("MetBrewer")
#install.packages("ggthemes")


#library(tidyverse)
#library(palmerpenguins)
#libraryy(patchwork)
#library(ggridges)
#library(gghighlight)
#library(MetBrewer)
#library(ggthemes)

Greetings!

Eunji Kong
4th year SPED doctoral student
Finished EDS specialization
EDS project
- Data Viz project
- Capstone

Learning Objectives

Understand the basic syntax requirements for {ggplot2}
Recognize various options for displaying data
Familiarity with various {ggplot2} options/layers
Basically, how to graph and visualize data

Lecture/Material Structure

PDF Lecture Notes
- Include hyperlinks that take you directly to the relevant topics
  - Hyperlinks: everything underlined
.qmd file (recommended)
- Same information as the PDF but allows you to write notes directly in the file
- You can also test out code interactively as you follow along
- You can then render this document as an HTML file for later review
- Visual mode

`{tidyverse}`

{tidyverse} is a a meta-package that loads a set of core packages

# If you don't have the package installed
# install.packages("tidyverse")

# load library
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

`{ggplot2}`

gg stands for “grammar of graphics”
Resources
- ggplot2 book
  - email Dr. Nese for digital copy
- Posit cheat sheet
  - can be helpful, perhaps more so after a little experience
- R Graphics Cookbook
- R Graph Gallery
  - past students have really liked this one

Components

Every ggplot has three components:

data
- the data used to produce the plot
aesthetic mappings (aes)
- between variables and visual properties
layers(s)
- usually through the geom_*() function plus various other layers

Template

I use the base R’s version of pipe |> instead of %>% but they are essentially the same thing.

data |> #pipe here
 ggplot(aes(mapping)) + #plus here
 geom_function() +
 additional layers

Above code is the same as the bottom code.

ggplot(data, aes(mapping)) +
geom_function() +
additional layers

data

# install.packages("palmerpenguins")

library(palmerpenguins)

head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

# str(penguins)

# glimpse(penguins)

# colnames(penguins)

# View(penguins)

ggplot(aes(mapping))

aesthetic mappings describe how variables in the data are mapped to visual properties
Some visual properties include:
- x
- y
- color (will come back to it)
- fill (will come back to it)
- alpha (will come back to it)
- others (linetype, shape, linewidth, size, group)

penguins |> 
  ggplot(aes(x = bill_length_mm, y = body_mass_g))

QUESTION: What do you see? Why is there nothing plotted?

ANSWER:

Layers

geom_function()

Use a geom_function() to represent data points

Only 1 Variable

Continuous Variable 2

Discrete Variable 2

Continuous Variable 1

geom_histogram

geom_density

geom_point

geom_smooth

geom_line

geom_density_ridges (from {ggridges})

geom_boxplot

geom_violin

geom_col

Discrete Variable 1

geom_bar

geom_count

Other

Heatmap: geom_tile

geom_histogram()

General research question: How does the values of my continuous variable vary across its range?

Our data specific question: How is the distribution of penguin bill lengths (mm) in this sample? Any outliers? Unimodal?

penguins |> 
  ggplot(aes(x = bill_length_mm)) + # Remember to use + instead of |> or %>%
  geom_histogram()

Color vs Fill

penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(color = "blue")

penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(color = "blue", 
                 fill = "green")

Color = outline

Fill = area

Transparency

penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(color = "blue", 
                 fill = "green",
                 alpha = 0.2)

Color, fill & alpha in this example area all fixed settings (i.e., applies to all data points).

More aes mapping

penguins |> 
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = sex), # note that fill is inside  aes()
                 alpha = 0.7)

Fill here is a conditional mapping, meaning that the fill color is different based on the variable (in this case the sex of the birds).

Fixed vs Conditional

penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(fill = "green")

In the above example where fill is not within aes(), fill is a fixed setting. Also notice that color is in quotes.

penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex))

In the above example, aes() is used to access variables and make changes according to a specific variable. Here, fill is a conditional on the variable, sex. Also notice that variables are not in quotes.

a <- penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(fill = "green")

b <- penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = sex))

#install.packages("patchwork")
library(patchwork)
a+b

Be mindful of aes()

penguins |>
ggplot(aes(x = bill_length_mm))+
geom_histogram(fill = “green”)

penguins |>
ggplot(aes(x = bill_length_mm))+
geom_histogram(aes(fill = “green”))

Question: What is wrong with the bottom code? How do you think the plot will look like?

Answer:

penguins |> 
ggplot(aes(x = bill_length_mm)) +
geom_histogram(aes(fill = "green"))

geom_density()

General research question: How does the probability density of my continuous variable vary across its range?

Think of it as a smoothed histogram

Difference: not use bins; not use count but use relative frequency per unit of x

penguins |>
  ggplot(aes(x = bill_length_mm)) + 
  geom_density()

More aes mapping

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex))

Add transparency for clarity

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5)

Histogram vs Density

a <- penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = sex), alpha = 0.5)

b <- penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5)

a + b

Question: What is the difference that you see? When would you use one vs another?

Answer:

facet_wrap

wrap by 1 variable

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5) +
  facet_wrap(~sex) # remember to use ~

wrap by 2 variables

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5) +
  facet_wrap(year~sex)

wrap using vars()

penguins |>
  ggplot(aes(x = bill_length_mm)) +
  geom_density(aes(fill = sex), alpha = 0.5) +
  facet_wrap(vars(year,sex))

geom_density_ridges()

geom_density_ridges: two variables

# install.packages("ggridges")

library(ggridges)

penguins |>
  ggplot(aes(bill_length_mm, sex)) +
  geom_density_ridges()

geom_point()

General research question: How are two numeric variables related? (raw observations)

Our data specific question: What is the relationship between penguin’s bill length and body mass?

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point()

Add color

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "magenta")

Emphasize specific data points (island = Torgersen)

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "magenta") +
  geom_point(data = filter(penguins, island == "Torgersen"), color = "blue")

penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(data = filter(penguins, island == "Torgersen"), color = "blue") +
  geom_point(color = "magenta")

Question: What happened when we switched the order of the geom_points?

Answer:

Emphasize another way

# install.packages("gghighlight")



penguins |>
  ggplot(aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "magenta") +
  gghighlight::gghighlight(island == "Torgersen")

geom_smooth()

General research question: What is the pattern of relationship of two continuous variables? (trend)

Our data specific question: What is the trend or pattern of relationship between penguin’s bill length and body mass?

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) + 
  geom_smooth()

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Method

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_smooth(method = "lm")

No need to include “x =” or “y =” because ggplot assumes the first argument will be x and then y.

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_smooth(method = "lm", level = .65)

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_smooth(method = "lm", se=FALSE)

Note: This is not the same as geom_line(). We are fitting a line of best fit with geom_smooth()

Adding Layers

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm")

Global

If we use something like color = “x” in the first aesthetic, it will carry on through all additional layers.

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g, color = species)) + #color = spieces
  geom_point() +
  geom_smooth(method = "lm")

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g, color = species)) +
  geom_point(aes(color = species)) +
  geom_smooth(method = "lm", aes(color = species))

Local

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + #color = spieces
  geom_smooth(method = "lm")

geom_line()

geom_point: raw observations, not linked
geom_smooth: trend/pattern
geom_line: raw data linked

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm")

penguins |>
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  geom_line()

When should you use line plots?

Usually when time is involved
One time point per line or per group
Shows linkage

# Original data
head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

# Create new data set so that there is only one data point per year
penguins_year <- penguins |>
  group_by(year) |> 
  summarize(avg_bill = mean(bill_length_mm, na.rm=TRUE))
head(penguins_year)

# A tibble: 3 × 2
   year avg_bill
  <int>    <dbl>
1  2007     43.7
2  2008     43.5
3  2009     44.5

penguins_year |>
  ggplot(aes(year, avg_bill)) + 
  geom_line()

# Create new data set so that there is one data point for each year for each species
penguins_year_species <- penguins |> 
  group_by(year, island) |> 
  summarize(avg_bill = mean(bill_length_mm, na.rm=TRUE))

head(penguins_year_species)

# A tibble: 6 × 3
# Groups:   year [2]
   year island    avg_bill
  <int> <fct>        <dbl>
1  2007 Biscoe        45.0
2  2007 Dream         44.5
3  2007 Torgersen     38.8
4  2008 Biscoe        44.6
5  2008 Dream         43.8
6  2008 Torgersen     38.8

penguins_year_species |>
  ggplot(aes(year, avg_bill, group = island, color = island)) + 
  geom_line()

geom_boxplot()

General research question: How is a continuous variable distributed across groups, and how do the medians, quartiles, and potential outliers compare?

penguins |>
  ggplot(aes(species, body_mass_g)) +
  geom_boxplot()

geom_violin()

General research question: How is the full distribution of a continuous variable shaped across groups?

penguins |>
  ggplot(aes(species, body_mass_g)) +
  geom_violin()

geom_bar()

geom_bar() vs geom_col()

geom_bar()	geom_col()
1 discrete variable	2 variables 1 continuous 1 discrete (unique)
counts rows height of the bar is proportional to the number of cases in each group	need to have a variable with numbers in your data (average, proportion, count)

penguins |> 
  ggplot(aes(species)) + # one variable in the `aes()`
  geom_bar()

geom_col()

summarized_penguins <- penguins |> 
  group_by(species) |> 
  summarize(N = n())

head(summarized_penguins)

# A tibble: 3 × 2
  species       N
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

summarized_penguins |>
  ggplot(aes(species, N)) +
  geom_col()

More aes mapping

summarized_penguins2 <- penguins |>
  group_by(species, sex) |>
  na.omit() |> 
  summarize(bill_length_avg = mean(bill_length_mm))

summarized_penguins2

# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    bill_length_avg
  <fct>     <fct>            <dbl>
1 Adelie    female            37.3
2 Adelie    male              40.4
3 Chinstrap female            46.6
4 Chinstrap male              51.1
5 Gentoo    female            45.6
6 Gentoo    male              49.5

ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
  geom_col(aes(fill = sex))

Position

ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
  geom_col(aes(fill = sex), position = "dodge")

coord_flip

ggplot(summarized_penguins2, aes(species, bill_length_avg)) +
  geom_col(aes(fill = sex), position = "dodge") +
  coord_flip()

geom_count()

General research question: How many observations fall in each category pair?

Our data specific question: How many of each species live in each island?

penguins |>
  ggplot(aes(species, island)) +
  geom_count()

Scales

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_gradient(low = "lightblue", high = "brown")

What do scales do?

Scales control how the mappings you added to aes are displayed (e.g., color range, size range, breaks and labels, range or limits)

Template: scale_*

Most aes mappings: x, y, size, color, fill, line, alpha, etc

Colorblind friendly

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_viridis_c()

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_viridis_c(option = "turbo") #magma, interno, plasma, viridis, cividis, rocket, mako, turbo or A-H

{MetBrewer} - inspired by art in the Met

#install.packages("MetBrewer")

penguins |>
  ggplot(aes(species, island)) +
  geom_count(aes(color = after_stat(n)))+
  scale_color_gradientn(colors=MetBrewer::met.brewer("Isfahan1"))

geom_tile()

General research question: What’s the value of a numerical measure (Z) for each (X, Y) pair? In other words, what is the correlation (Z) for each X,Y pair?

corr <- penguins |>
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) |>
  drop_na() |>                
  cor()

pc <- corr |> 
  as.data.frame() |> 
  rownames_to_column(var = "row") |> 
  pivot_longer(
    cols = -row,
    names_to = "col",
    values_to = "cor")

head(pc)

# A tibble: 6 × 3
  row            col                  cor
  <chr>          <chr>              <dbl>
1 bill_length_mm bill_length_mm     1    
2 bill_length_mm bill_depth_mm     -0.235
3 bill_length_mm flipper_length_mm  0.656
4 bill_length_mm body_mass_g        0.595
5 bill_depth_mm  bill_length_mm    -0.235
6 bill_depth_mm  bill_depth_mm      1

ggplot(pc, aes(row, col, fill = cor)) +
  geom_tile()

ggplot(pc, aes(row, col, fill = cor)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(pc, aes(row, col, fill = cor)) +
  geom_tile() +
  scale_fill_viridis_c()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Other Layers

labels

axis labels

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species))

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)")

title

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass")

subtitle

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species")

caption

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins")

tag

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins",
       tag = "(A)")

legend (one way)

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins",
       tag = "(A)",
       color="SPECIES!")

theme

The default is theme_gray(). There are a lot of built-in alternative in {ggplot2}. My go-to is theme_minimal() because it is clean without a lot of unnecessary visuals.

If you want to set theme globally (meaning to all your graphs in your document), add theme_set(theme_minimal()) to the first line after you load your libraries.

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x="Bill length (mm)",
       y="Body mass (g)",
       title = "Relationship between bill length and body mass",
       subtitle="Grouped by species",
       caption = "palmerpenguins",
       tag = "(A)",
       color="SPECIES!") +
  theme_minimal()

Other packages:

#install.packages("ggthemes")

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color=species)) + 
    ggthemes::theme_economist()+
  ggthemes::scale_color_economist()

penguins |> 
  ggplot(aes(bill_length_mm, body_mass_g)) +
  geom_point(aes(color = species)) + 
  labs(x        = "Bill length (mm)",
       y        = "Body mass (g)",
       title    = "Relationship between bill length and body mass",
       subtitle = "Grouped by species",
       caption  = "palmerpenguins",
       tag      = "(A)",
       color    = "SPECIES!") +
  theme(plot.title        = element_text(size=13, face="bold", hjust =0.5), 
        axis.title        = element_text(size=11, family="Georgia"), 
        axis.text.x       = element_text(size=10, angle = 45, hjust=1),
        panel.background  = element_rect(fill = "grey95"),
        plot.background   = element_rect(fill = "white"),
        panel.grid.major  = element_line(color = "black"),
        panel.grid.minor  = element_blank(),
        legend.position   = "top",
        legend.title      = element_text(face="bold"),
        legend.background = element_rect(fill = "transparent"))

Practice together

1

Get to know the data - str(mpg) or head(mpg)

2

What is the overall distribution of city fuel efficiency (mpg) across car models?

3

How does the distribution vary by drivetrain type (e.g., front-, rear-, 4-wheel drive)?

4

What is the relationship between city and highway mpg?

5

Can we focus on/emphasize Audi’s relationship?

6

Can we have larger points for clarity?

7

How are the city/hwy mpg relationships different by car class?

8

Too much clutter. Can we just see trends?

9

Still too much clutter. Better way to clearly see each trends?

10

Can we make it colorblind friendly?

11

Can we clarify axis and legend labels?

12

Can we polish the appearance with a theme?